I will try to write my thoughts white I perform my data analysis.
import pandas as pd
import numpy as np
from pandas.plotting import lag_plot, autocorrelation_plot
import datetime
from datetime import timedelta
from matplotlib import pyplot
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.tsa.arima_model import ARIMA
from plotly.offline import init_notebook_mode, iplot
from plotly import graph_objs as go
from statsmodels.tsa.ar_model import AR
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.cross_validation import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import LinearRegression
from math import sqrt
import bisect
from pickle import dump, load
import warnings
warnings.filterwarnings('ignore')
def plotly_actual_vs_predicted(index = None,
actuals = None,
predictions = None,
title = '',
mode = 'lines+markers',
annotations = None):
common_kw = dict(x=index, mode=mode)
xaxis = dict(title=title)
data_actuals = go.Scatter(y=actuals, name='actuals', **common_kw)
data_predictions = go.Scatter(y=predictions, name='predictions', **common_kw)
data = [data_actuals, data_predictions]
layout = dict(title=title, showlegend=True, annotations=annotations, xaxis=xaxis)
fig = dict(data=data, layout=layout)
iplot(fig, show_link=False)
def plot_autocorrelation(values):
pyplot.figure(figsize=(50, 20), edgecolor='b')
pyplot.grid(True)
plot_acf(values.astype(float), ax=pyplot.gca())
pyplot.show()
pyplot.clf()
dataset1 = pd.read_csv('data1.csv', header=None, names=['valor'])
# Check missing data
print(dataset1.isnull().sum())
# Data distribution
dataset1.describe()
Considering that each line is a DAY, we are going to create a logic to define a sequencial index of days (starting from today and going back in the calendar). We dont have necessarily to create these dates, but I think it will help to visualize the serie.
Note: in production datasets it is very common to find gaps in the dataset. If that's the case, I would have to work on inputation to fix these gaps. Here, I dont have this issue.
# Add date to the dataframe
dataset1['col_dt'] = datetime.datetime.now()
dataset1['linha'] = dataset1.index.astype(int)
dataset1['col_dt'] = dataset1.apply(lambda x: x.col_dt - timedelta(days = x.linha), axis=1)
# Store my TS on "ts1" object
ts1 = pd.Series(index=dataset1.col_dt, data=dataset1.valor.values)
# Check some records
print(ts1.head())
# Sanity check (make sure we did not mess the data)
data = {'col_dt': ts1.index, 'valor': ts1.values}
ts1_df = pd.DataFrame(data=data)
sanity_df = pd.merge(dataset1, ts1_df, on=['col_dt'], how='inner')
print('Sanity test: {} inconsistencies'.format(len(sanity_df[sanity_df['valor_x'] != sanity_df['valor_y']])))
Lets analyse the following components:
Sometimes we find isolated peaks which cannot be captured in the model. On such cases, we might want to threat those peaks as "outliers".
It's also worth to highlight that we can use additive or multiplicative approaches to decompose the serie. Additive models suggest that your time series ADDs each component to explain the target variable:
y(t) = Level + Trend + Seasonality + Noise
while Multiplicative models suggest that your time series MULTIPLIES each component to explain f(x):
y(t) = Level Trend Seasonality * Noise
In this case, I am using additive approach because, looking to the data, I dont see huge changes on each value of Y, which makes me think this is closer to linear functions rather than quadract or exponential functions.
# Perform naive decomposition (check package documentation for more info https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html
# We can refer to each result by .trend, .seasonal, .resid and .observed
result = seasonal_decompose(ts1, model='additive')
result.plot()
pyplot.show()
Looking to these components, we cannot see any trend or seasonality in the serie. The pattern is always constant. Such behavior makes me think this is an stationary serie, since the average value of the serie is remains constant. Lets check.
stationary_test = adfuller(ts1, autolag='AIC')
stationary_test
For each of those confidence intervals (99%, 95% and 90%), IF the critical value is greater than your test statistic (-80.59), then the serie is stationary. In this case, this serie is stationary with 99% of confidence
Finally, you we might want to check autocorrelation to guarantee this is not a wite noise (which does not make sense, based on the evidences so far)
Checking ACF plot
plot_autocorrelation(ts1.values)
Because we see many points out of the blue zone, we can assume this is autocorrelated. If you are not sure, we can take a look on other visualizations. For example, pandas has a built-in function where we can plot t and t+1 lag variable with minimal effort in terms of data transformation.
lag_plot(ts1)
This is definetelly high correlated. One last test is actually getting the Pearson correlation index
values = pd.DataFrame(ts1.values)
dataframe = pd.concat([values.shift(1), values], axis=1)
dataframe.columns = ['t', 't+1']
result = dataframe.corr()
result
Correlation on 99%
This serie is autocorrelated and stationary (no trend or season components). We should be good to go ahead and forecast
We should combine our forecasting algorithm with the characteristics of the serie. Very common choices use to be exponential smoothing methods for non-stationary series and ARIMA for stationary series.
Note: There are many extensions of exponential smoothing methods...to choose the best approach to a given problem, we should again look to the trend and seasonality components. It is also possible to explore RNNs to do forecasting, since this type of Neural Net is used to solve sequence problems.
Here, just because I have a high auto correlated dataset, I will give a try on auto correlation models This is a linear regression model, but the features used are actually your lag variables. AR class has a builtin method to select the optimal number of lag variables as features.
yhat = b0 + (b1 X1) + (b2 X2) ... (bn * Xn), where:
# split the data into train and test (in this case I am selecting the last 25% of the data for testing)
number_points = len(ts1.values)
last_points_for_testing = int(number_points * 0.25)
print ('-------Using {} rows for testing'.format(last_points_for_testing))
data_points = ts1.values
train, test = data_points[1:len(data_points)-last_points_for_testing], data_points[len(data_points)-last_points_for_testing:]
# train AR model
model = AR(train)
model_fit = model.fit()
print('Number of lags: {}'.format(model_fit.k_ar))
print('Coefficients: {}'.format(model_fit.params))
# make predictions of the test period
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)
for i in range(len(predictions)):
print('predicted=%f, expected=%f' % (predictions[i], test[i]))
rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)
plotly_actual_vs_predicted(index=ts1.index[last_points_for_testing:], actuals=test, predictions=predictions, title='Actuals vs Predictions', mode = 'lines')
It might look overfitted, but it is not. Because the very high correlation of this dataset, the AR model was able to capture the exact distribution on the data using 30 lags.
dataset2 = pd.read_csv('data2.csv', header=None, names=['valor'])
# Adiciona data no dataframe
dataset2['col_dt'] = datetime.datetime.now()
dataset2['linha'] = dataset2.index.astype(int)
dataset2['col_dt'] = dataset2.apply(lambda x: x.col_dt - timedelta(days = x.linha), axis=1)
# Grava a serie temporal no objeto ts1
ts2 = pd.Series(index=dataset2.col_dt, data=dataset2.valor.values)
# Verifica alguns registros
print(ts2.head())
# Sanity check para garantir que nossa serie nao misturou informacoes de datas diferentes
data = {'col_dt': ts2.index, 'valor': ts2.values}
ts2_df = pd.DataFrame(data=data)
sanity_df = pd.merge(dataset2, ts2_df, on=['col_dt'], how='inner')
print('Sanity test: {} inconsistencias encontradas'.format(len(sanity_df[sanity_df['valor_x'] != sanity_df['valor_y']])))
# Perform naise decomposition
result = seasonal_decompose(ts2, model='additive')
result.plot()
pyplot.show()
This serie presents some patterns on it's trend. There were a decreasing trend from the beginning of the serie to the half part. Then, we observe a quick increasing movement followed by a slowly decay. Finally another increasing pattern. The decreasing/increasing movements are not exactly "periodic enough" to associate with a season component.
From a business perspective, I tend to think this is related to non-periodic triggers (maybe creation of new lines of business, new products, etc...), which might change the behavior. However, in a global view, there is a decay movement.
Because the trend is NOT NULL, we might have a non-stationary serie. Lets see next.
stationary_test = adfuller(ts2, autolag='AIC')
stationary_test
For each of those confidence intervals (99%, 95% and 90%), IF the critical value is greater than your test statistic (0.014), then the serie is stationary. In this case, this serie is NOT stationary with 99% of confidence
This time, lets take a look in the autocorrelation plot using a pandas function.
autocorrelation_plot(ts2.values)
Since we have many point outside the dashed line, it looks there are a lot of correlation between lags. Lets again check the Pearson correlation between t and t+1
values = pd.DataFrame(ts2.values)
dataframe = pd.concat([values.shift(1), values], axis=1)
dataframe.columns = ['t', 't+1']
result = dataframe.corr()
result
This time we are going to use ARIMA, which is a generalization of the previous model we have used (AR), with some other componentes:
Each of these components are hard set through params, used refered as ARIMA (p,d,q), where:
# split the data into train and test (in this case I am selecting the last 25% of the data for testing)
number_points = len(ts2.values)
last_points_for_testing = int(number_points * 0.25)
print ('-------Using {} rows for testing'.format(last_points_for_testing))
data_points = ts2.values
train, test = data_points[1:len(data_points)-last_points_for_testing], data_points[len(data_points)-last_points_for_testing:]
Note: because this is a non-stationary TS, we have to apply differencing (will use order 1)
# fit model
model = ARIMA(train, order=(5,1,0))
model_fit = model.fit(disp=0)
# summary of fit model
print(model_fit.summary())
# line plot of residuals
residuals = pd.DataFrame(model_fit.resid)
residuals.plot()
pyplot.show()
# density plot of residuals
residuals.plot(kind='kde')
pyplot.show()
# summary stats of residuals
print(residuals.describe())
Because the mean of my residuals is too close to zero and the distribution looks Gaussian, I dont thing there is any hard bias behind the scene. I will go ahead and perform my predictions.
However, usually we would spend more time on this step to find the best set of parameters for this model (probablt using grid-search)
Now we will discuss a very important and particular aspect of TS modeling. Once we finish our modeling phase and we want to go to production predictions, we have to think about how the pipeline will work.
On regular ML models, we train the model just ONCE and then we expect to keep using the same model UNTIL signs of model drift appears. On TS, it is usually a good idea to keep retraining your model as soon as you receive more information. The reason is: as you go further with your predictions, your assertiveness tends to decrease.
Such issue did not happen on our first dataset just because it was following a very hard linear behavior. Then, even with very long predictions we had no damage on assertiveness. While here, on the second dataset, we can clearly see the decay of assertiveness.
# make predictions of the test period
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False, typ='levels')
for i in range(len(predictions)):
print('predicted=%f, expected=%f' % (predictions[i], test[i]))
rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)
Based on those evidences, retraining your model everytime you receive more data is a better approach.
Quick note: based on some prior experience, I have received some requests from users to make predictions for more than one time step. We have to well-communicate users about the "assertiveness behavior of TS". If you communicate an average RMSE, users might have been informed that that metric tends to be better on first time steps and worse on latest time steps.
If you want to get away from this issue, you can do a daily training and walk-forward validation as follow:
history = [x for x in train]
predictions = list()
confidence_lower = list()
confidence_upper = list()
# walk-forward validation
for t in range(len(test)):
model = ARIMA(history, order=(5,1,0))
model_fit = model.fit(disp=0)
#output = model_fit.forecast()
output, stderr, conf = model_fit.forecast(alpha = 0.05)
yhat = output[0]
predictions.append(yhat)
confidence_lower.append(conf[0][0])
confidence_upper.append(conf[0][1])
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
# evaluate forecasts
rmse = sqrt(mean_squared_error(test, predictions))
print('Test RMSE: %.3f' % rmse)
# plot forecasts against actual outcomes
pyplot.plot(test)
pyplot.plot(predictions, color='red')
pyplot.show()
Now we have a much better model (RMSE: 16.427) and no communication challenge. Lets improve the user experience with a better graph:
plotly_actual_vs_predicted(index=ts2.index[last_points_for_testing:].sort_values(), actuals=test, predictions=predictions, title='Actuals vs Predictions')
And now lets add the 95% confidence intervals (I did not add in the first dataset because it is too linear)
common_kw = dict(x=ts2.index[last_points_for_testing:].sort_values(), mode='lines+markers')
xaxis = dict(title='Predictions with confidence interval')
data_actuals = go.Scatter(y=test, name='actuals', **common_kw)
data_predictions = go.Scatter(y=predictions, name='predictions', **common_kw)
upper_bound = go.Scatter(y=confidence_upper, name='upper', mode='lines', x=ts2.index[last_points_for_testing:].sort_values(), marker=dict(color="#444"), line=dict(width=0), fillcolor='rgba(68, 68, 68, 0.3)', fill='tonexty')
lower_bound = go.Scatter(y=confidence_lower, name='lower', mode='lines', x=ts2.index[last_points_for_testing:].sort_values(), marker=dict(color="#444"), line=dict(width=0))
data = [lower_bound, upper_bound, data_actuals, data_predictions]
layout = dict(title='Predictions with confidence interval', showlegend=True, xaxis=xaxis)
fig = dict(data=data, layout=layout)
iplot(fig, show_link=False)
dataset3 = pd.read_csv('data3.csv')
dataset3['bond_id'] = dataset3['bond_id'].astype(str)
# There are some missing values into curve_based_price_last1. Because this feature is a TS, I am going to
# use the prior observation to fill missing values
df1 = dataset3.curve_based_price_last1.to_frame()
df1['id'] = df1.index
df2 = dataset3.curve_based_price_last1.shift(1).to_frame()
df2['id'] = df2.index
df_merge = pd.merge(df1, df2, on=['id'], how='inner')
df_merge['curve_based_price_last1'] = np.where(df_merge['curve_based_price_last1_x'].isnull(), df_merge['curve_based_price_last1_y'], df_merge['curve_based_price_last1_x'])
dataset3['curve_based_price_last1'] = df_merge['curve_based_price_last1']
xaxis = dict(title='Predictions with confidence interval')
data_actuals = go.Scatter(y=dataset3.tail(500)['trade_price'], name='trade price', mode='lines+markers')
data = [data_actuals]
layout = dict(title='Predictions with confidence interval', showlegend=True, xaxis=xaxis)
fig = dict(data=data, layout=layout)
iplot(fig, show_link=False)
It looks we have some outliers. We will come back to this topic later, if needed. Lets decompose de serie
# Add date to the dataframe
dataset3['col_dt'] = datetime.datetime.now()
dataset3['linha'] = dataset3.index.astype(int)
dataset3['col_dt'] = dataset3.apply(lambda x: x.col_dt - timedelta(hours = x.linha), axis=1)
# Store my TS on "ts1" object
ts3 = pd.Series(index=dataset3.col_dt, data=dataset3.trade_price.values)
# Check some records
print(ts3.head())
# Perform naive decomposition (check package documentation for more info https://www.statsmodels.org/stable/generated/statsmodels.tsa.seasonal.seasonal_decompose.html
# We can refer to each result by .trend, .seasonal, .resid and .observed
result = seasonal_decompose(ts3, model='additive')
result.plot()
pyplot.show()
Now, we will try to understand the most relevant variables.
While I was doing some research about this topic, I found this interesting package https://tsfresh.readthedocs.io/en/latest/. I will take a look on it in another opportunity, just thought you might be interesting too. Here, I will try to understand the feature importance using my regression coefficients.
We have one target variable "trade_price" and a bunch of independent variables. We got some instructions to give special attention to the "bond_id" variable.
For this example, I am using SelectKBest from sk-learn. This function will show feature importance based on the F-value between label/feature for regression tasks.
features = [
'bond_id',
'weight',
'current_coupon',
'time_to_maturity',
'is_callable',
'reporting_delay',
'trade_size',
'trade_type',
'curve_based_price',
'received_time_diff_last1',
'trade_price_last1',
'trade_size_last1',
'trade_type_last1',
'curve_based_price_last1'
]
df_features = dataset3[features]
# Checking some variable importance
selector = SelectKBest(score_func=f_regression, k='all')
fit = selector.fit(df_features, dataset3['trade_price'])
# Show results
varImp = dict()
for variable, score, topX in zip(df_features.columns, fit.scores_, selector.get_support()):
varImp[variable] = score
pd.DataFrame.from_dict(varImp, orient='index').to_csv('varImp.csv')
varImp
I have done some manual work to highlight feature importance based on these results. All of these features play a certain level of importance, I built this heat map to show the most important ones (from top to bottom)

I dont have any business background here, but based on some background that I got from the OLX team (the test itself), it makes a lot of sense. Let's see why:
In fact, it might be worth to explore all of them, but I will stop my arguments here due to my limited time. There is a question around the "bond_id" variable. I will try to analyse it's levels.
First, we will create a dataset of bonds with enough data (at least more than one record). This is needed just because I will work with a stratified sample of the data. BTW, ideally, I should not be using train and test splits to fo feature engineering, but I will go ahead with this approach.
aggregations = dict()
aggregations['id'] = 'count'
df_bond = dataset3.groupby(['bond_id'], as_index=False).agg(aggregations)
df_bond = df_bond.rename(columns={'id': 'count'})
df_bond = df_bond[df_bond['count'] > 1]
df_bond.head()
Now, I will limit my dataset3 by those bonds
# Filter dataset3
dataset3_with_enough_bond = pd.merge(dataset3, df_bond, on=['bond_id'], how='inner')
# Specify the feature space
features = [
'bond_id',
'weight',
'current_coupon',
'time_to_maturity',
'is_callable',
'reporting_delay',
'trade_size',
'trade_type',
'curve_based_price',
'received_time_diff_last1',
'trade_price_last1',
'trade_size_last1',
'trade_type_last1',
'curve_based_price_last1'
]
df_features = dataset3_with_enough_bond[features]
In the next steps I am going to perform a one hot encoder operation in the dataset, using bond_ids as features. That will give me a very sparse matrix. To avoid running out of memory, I will play with a stratified sample of the data.
X_train, X_test, y_train, y_test = train_test_split(df_features, dataset3_with_enough_bond['trade_price'], test_size = 0.5, stratify=df_features.bond_id)
df_features = X_train
Finally, I will go ahead an create my one hot encoder variables using my sample dataset
# Create a key column just to facilitate the final merge in the end of this section
df_features['id'] = df_features.index
###
### PART 1 - CREATE A FACTOR VARIABLE (LABEL ENCODER)
###
le_classificacao = preprocessing.LabelEncoder()
le_classificacao.fit(df_features.bond_id)
df_features['le_classificacao'] = le_classificacao.transform(df_features.bond_id)
dump(le_classificacao, open('le_classificacao.sav', 'wb'))
###
### PART 2 - DUMMIES
###
# Create a new dataset with key + input variable
dataset_new = df_features[['id', 'le_classificacao']]
# OneHotEncoder
encoder = OneHotEncoder(categorical_features = np.array([False, True]), dtype=bool, sparse=True)
encoder.fit(dataset_new)
# Save for future use
dump(encoder, open('encoder.sav', 'wb'))
# Prepare column names
colnames = list(le_classificacao.classes_)
colnames.append('id')
# Execute the transformation in the dataset
results = encoder.transform(dataset_new)
dataset_encode = pd.DataFrame(results.toarray(), columns=colnames)
# Convert data types
dataset_encode = dataset_encode.astype(str)
dataset_encode['id'] = dataset_encode['id'].astype(float)
# Merge final results
dataset_merge = pd.merge(df_features, dataset_encode, on='id', how='inner')
# Check results
dataset_merge.head()
Doing some clean up before testing variable importance
df_one_hot_encoded = dataset_merge.drop(['id', 'bond_id', 'le_classificacao'] ,axis=1)
df_one_hot_encoded.dtypes
Checking some variable importance again, this time considering my one hot encoder variables
selector = SelectKBest(score_func=f_regression, k='all')
fit = selector.fit(df_one_hot_encoded, y_train)
# Show results
varImp = dict()
for variable, score, topX in zip(df_one_hot_encoded.columns, fit.scores_, selector.get_support()):
varImp[variable] = score
pd.DataFrame.from_dict(varImp, orient='index').to_csv('varImp.csv')
varImp

It looks there are a few bonds more important than others.
I will prepare my modeling dataset. Instead of using sklearn encoder, I decided to create my dummy variables manually.
modeling_df = dataset3[['trade_price', 'bond_id', 'trade_price_last1', 'curve_based_price', 'curve_based_price_last1', 'is_callable']]
modeling_df['is_348003'] = np.where(modeling_df['bond_id'] == '348003', '1', '0')
modeling_df['is_195002'] = np.where(modeling_df['bond_id'] == '195002', '1', '0')
modeling_df['is_callable'] = modeling_df['is_callable'].astype(str)
modeling_df = modeling_df.drop(['bond_id'], axis=1)
modeling_df.head()
# Separate target and features
target = modeling_df['trade_price']
datasetFiltered = modeling_df.drop('trade_price', 1)
X = datasetFiltered.values
Y = target
# split the data into train and test (in this case I am selecting the last 25% of the data for testing)
number_points = len(X)
last_points_for_testing = int(number_points * 0.25)
print ('-------Using {} rows for testing'.format(last_points_for_testing))
X_train, X_test = X[1:len(X)-last_points_for_testing], X[len(X)-last_points_for_testing:]
Y_train, Y_test = Y[1:len(Y)-last_points_for_testing], Y[len(Y)-last_points_for_testing:]
# Create the model
baseline = LinearRegression()
baseline.fit(X_train, Y_train)
# r2
print(baseline.score(X_train, Y_train))
# r2 adjusted
print (1 - (1-baseline.score(X_train, Y_train))*(len(Y_train)-1)/(len(Y_train)-X_train.shape[1]-1))
# Testa contra os dados de teste
predictions = baseline.predict(X_test)
# MSE
mean_squared_error(Y_test, predictions)
So far my metrics looks ok. I will finish looking to the residuals
pyplot.plot(Y_test - predictions)
pd.DataFrame(Y_test - predictions).plot(kind='density')
They look random and normal distributed. Good point then.
Lets finally visualize results.
plotly_actual_vs_predicted(actuals=Y_test[len(Y_test)-500:len(Y_test),], predictions=predictions[len(Y_test)-500:len(Y_test),], title='Actuals vs Predictions')
Since our predictions looks good, I will pack this model and make it available on my Flask API.
Thanks
filename = 'model.sav'
dump(baseline, open(filename, 'wb'))